class 4 - lecture notes ----------------------- problems in bioinformatics (overview) - sequencing genomes: physical mapping, sequence assembly concrete problem: sequence assembly problem: given: fragments of a large DNA sequence (e.g., human chromosome) with overlaps (multiple coverage) want: entire sequence complicating factors: - computational complexity: problem can be seen as variation of shortest common superstring problem which is known to be NP-hard (see, e.g., http://citeseer.nj.nec.com/kececioglu93combinatorial.html) - incorrect/missing nucleotides ion fragment data (1-10%) - sequence repeats solution strategy: - determine fragment overlaps - orient and approx layout fragments based on overlapss - use multiple seq alignment of fragments to infer sequence (see http://www.dkfz-heidelberg.de/tbi/bioinfo/PhysicalMapping/SeqAssembly/, **http://www.biostat.harvard.edu/~dscholte/ieeealgs.pdf ) - interpreting genomic sequence data: sequence alignment, gene finding, sequence annotation (intron/exons, promotors, specific rna genes), structure prediction (rna and proteins), pattern discovery, classification, clustering concrete problem: rna secondary structure prediction problem given: an RNA sequence X want: secondary structure of X (base pairings in X) complications: - factors determining secondary structure are complex and not entirely understood - secondary structure of X may not be uniquely determined - computational complexity - understanding the cell / organisms: understanding interactions between proteins, dna, rna; regulatory pathways and networks, simulations concrete problem: gene regulatory relationship inference given: expression profiles of two genes A,B (e.g., from microarray analysis) want: decide whether A activates B, B activates A, A inhibits B, B inhibits A, or there is no (direct) regulatory relationship between A and B complications: - imprecision/limitations in measuring expression profiles - indirect / complex regulatory relationships - relations between organisms and evolutionary questions: phylogenetic trees, computational models of evolution, simulations concrete problem: phylogenetic tree inference given: homologous dna sequences from multiple species want: evolutionary tree relating these species complications: - errors in sequence - complexity/quality of multiple sequence alignment - limited knowledge of evolutionary processes - biomolecular computing: models of biomolecular computing: e.g., p-systems dna computing, inverse folding, self-assembling structures concrete problem: design of self-assembling dna nanostructures given: desired structure formed by multiple strands of dna want: find sequences for each strand (building block) such that entire structure forms spontaneously complications: - errors in self-assembly (undesired interactions between strands) - efficiency of self-assembly process - related aspects (often not counted as bioinformatics) - simple tools (simple motiv search, dna-rna-protein seq translations, ...) - computational models of organisms or biological systems (biocybernetics) - nature-inspired algorithms (genetic algorithms, neural networks, ant colony optimisation, ...) - artificial life (two flavours: life-like behaviour of artificial systems, (re-)design of biological organisms) - three views on bioinformatics: - cs just provides tools for biologists / biochemists problems: lack of biological knowledge can lead to weaknesses in tools lack of cs understanding can lead to questionable application of tools - cs research motivated by biological problems problem: lack of biological knowledge + overabstraction leads to algorithms/results with limited/no biological application - synergistic, interdisciplinary research problem: needs close collaboration of experts in cs and molecular biology, broad common basis of cs and basic biological knowledge but: high potential for innovative and intriguing research and applications (last 15min of class: project group formation / informal discussion)